Embedding Project 1

Demo video + Source code

Project Goal

This project aims to create a tool for searching podcast transcripts to easily revisit specific segments of episodes from a YouTube playlist.
For instance, One of the podcast I follow has 215 episodes, and manually searching for relevant sections in old episodes is time-consuming. This tool aims to automate the process by enabling semantic-based searches across transcripts, returning precise video segments with timestamps. While the concept can apply to any video or audio with transcripts, this implementation focuses on YouTube playlist videos.
Here is what the output of the program looks like

» uv run main.py --name pt --search  "the word cat only means something because it isn't the word cow"                                      

Searching for: "the word cat only means something because it isn't the word cow" in collection: pt
1. Episode #115 ... Structuralism and Context. Confidence score = 0.64401466.
         - https://www.youtube.com/watch?v=CZSrvFKGuC8&start=923&end=954
2. Episode #117 ... Structuralism and Mythology pt.  2. Confidence score = 0.6229924.
         - https://www.youtube.com/watch?v=adlc1yY47xE&start=1193&end=1223
3. Episode #216 ... The Self-Overcoming of Nihilism - Kyoto School pt. 1 - Nishitani. Confidence score = 0.5024827.
         - https://www.youtube.com/watch?v=eD_gi25iBIE&start=1494&end=1524

Tools Used

Vector Database: Qdrant, for storage and retrieval of transcript embedding.
Text Embedding Model: SentenceTransformer (all-MiniLM-L6-v2), for generating semantic embedding of transcript text.
YouTube Video Downloader: yt-dlp, for downloading transcripts without video files.
Subtitles Processing Library: webvtt, for parsing subtitle files (VTT format) with timestamps.

Process Overview

The workflow involves downloading transcripts, processing them into searchable chunks, and enabling keyword-based searches using a vector database.

Step 1: Downloading Transcripts

Tool: yt-dlp
Process:
- Use yt-dlp to download transcripts from a YouTube playlist without downloading video files.
- Key yt-dlp options:
  - --skip-download: Downloads only transcripts, not video files.
  - --ignore-errors: Skips private or unavailable videos.
  - --sub-format vtt: Downloads subtitles in WebVTT (.vtt) format for compatibility with YouTube's default subtitle format (other formats like SRT are also supported).
- The script supports multiple playlists, identified by a unique collection name(i.e playlist name). If the same collection name is reused, previous data (folders, files, and database entries) is deleted to start fresh.

input_path = f"collections/{collection_name}/subtitles"  # where to store the downloaded subtitles  
  
if os.path.isdir(input_path):  
    print("Deleting existing directory")  
    shutil.rmtree(input_path)  
  
ydl_opts = {  
    'skip_download': True,  
    'subtitlesformat': 'vtt',  
    'subtitleslangs': ['en'],  
    'writeautomaticsub': True,  
    'writesubtitles': True,  
    "ignoreerrors": True, #skip private videos  
    "paths": {  
        "home": input_path  
    }  
}  
try:  
    # Create a YoutubeDL instance and extract information  
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:  
        # Download/extract subtitles  
        error_code = ydl.download([load_url])  
        print(f"yt_dlp error code {error_code}")

Step 2: Processing Transcripts

Library: webvtt
Process:
- Read .vtt subtitle files using webvtt to extract transcript text and timestamps.
- Address duplicate subtitle entries (a common issue in YouTube transcripts) by using a dictionary to store unique lines, effectively simulating a sorted set since Python lacks a built-in sorted set data structure.
- Transform transcripts into searchable chunks:
  - Chunking: Split transcripts into 30-second segments (configurable duration, chosen experimentally).
  - Extracted Information:
    - Transcript with Timestamps: Each chunk includes text and start/end timestamps.
    - Video ID: Used to construct YouTube URLs with start/end times for direct access to the relevant segment.
    - File Name: Acts as a grouping key to limit search results to one match per video, avoiding multiple results from the same episode.

print(f"Downloading subtitle done, starting chunk process")  
segment_point_in_second = 30  # segments subtitles per 30-second window frame
chunks = process_subtitle_file(input_path, segment_point_in_second)  
  
print(f"Finished creating chunks for: {len(chunks)} files, next storing to db")  
store_subtitle_data(  
    chunks=chunks,  
    collection_name=collection_name,  
    client=qdrant_client,  
    encoder=encoder  
)  
print("Finished storing.")

def process_subtitle_file(input_path, segment_time_second: int):  
    chunks = defaultdict(list)  
  
    for file_name in os.listdir(input_path):  
        file_path = os.path.join(input_path, file_name)  
  
        start_time = 0  
        text = {}  
        file_name_clean = clean_file_name(file_name)  
        captions = webvtt.read(file_path)  
  
        for caption in captions:  
            end_time = time_to_seconds(caption.end)  
            for t in caption.text.strip().split("\n"):  
                text[t] = None  
  
            if end_time - start_time >= segment_time_second or caption == captions[-1]:  
  
                chunks[file_name_clean].append({  
                    "file_name": file_name_clean,  
                    "start": start_time,  
                    "end": end_time,  
                    "text": " ".join(text.keys()),  
                    "video_id": get_youtube_id(file_name)  
                })  
  
                text = {}  
                start_time = end_time  
    return chunks

Step 3: Searching

Process:
- Accept a user query
- Search the Qdrant database to find the most relevant transcript chunks based on semantic similarity.
- Return formatted results, including:
  - The matching transcript text.
  - A YouTube URL with timestamps to jump directly to the relevant video segment.
  - The video title or filename for context.

print(f"Searching for: '{search_query}' in collection: '{collection_name}")
        hits = initiate_rag_search(
            query=args.search,
            collection_name=collection_name,
            client=qdrant_client,
            encoder=encoder
        )

        counter = 1
        for pointGroup in hits.groups:
            for hit in pointGroup.hits:
                metas = hit.payload
                file_name = metas["file_name"]
                start_time = metas["start"]
                end_time = metas["end"]
                video_id = metas["video_id"]

                print(f"{counter}. {file_name}. Confidence score = {hit.score}.")
                print(f"\t - https://www.youtube.com/watch?v={video_id}&start={start_time}&end={end_time}")
                counter += 1

Project Goal​

Tools Used​

Process Overview​

Step 1: Downloading Transcripts​

Step 2: Processing Transcripts​

Step 3: Searching​

Project Goal

Tools Used

Process Overview

Step 1: Downloading Transcripts

Step 2: Processing Transcripts

Step 3: Searching